{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Q2MY4-M1zuhV"
   },
   "source": [
    "# Ungraded Lab: Training a Sarcasm Detection Model using Bidirectional LSTMs\n",
    "\n",
    "In this lab, you will revisit the [News Headlines Dataset for Sarcasm Detection](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection) dataset and use it to train a Bi-LSTM Model.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jEuDfViGoQKP"
   },
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "sLshUgUtoOWC"
   },
   "outputs": [],
   "source": [
    "import json\n",
    "import matplotlib.pyplot as plt\n",
    "import tensorflow as tf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "S-AgItE6z80t"
   },
   "source": [
    "## Load the Dataset\n",
    "\n",
    "First, you will download the JSON file and extract the contents into lists."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "k_Wlz9i10Dmn"
   },
   "outputs": [],
   "source": [
    "# The dataset is already downloaded for you. For downloading you can use the code below.\n",
    "# !wget https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Pr4R0I240GOh"
   },
   "outputs": [],
   "source": [
    "# Load the JSON file\n",
    "with open(\"./sarcasm.json\", 'r') as f:\n",
    "    datastore = json.load(f)\n",
    "\n",
    "# Initialize the lists\n",
    "sentences = []\n",
    "labels = []\n",
    "\n",
    "# Collect sentences and labels into the lists\n",
    "for item in datastore:\n",
    "    sentences.append(item['headline'])\n",
    "    labels.append(item['is_sarcastic'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0E2uXg8Z9n6n"
   },
   "source": [
    "## Parameters\n",
    "\n",
    "We placed the constant parameters in the cell below so you can easily tweak it later:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "jApcxifG9jSe"
   },
   "outputs": [],
   "source": [
    "# Number of examples to use for training\n",
    "TRAINING_SIZE = 20000\n",
    "\n",
    "# Vocabulary size of the tokenizer\n",
    "VOCAB_SIZE = 10000\n",
    "\n",
    "# Maximum length of the padded sequences\n",
    "MAX_LENGTH = 32\n",
    "\n",
    "# Type of padding\n",
    "PADDING_TYPE = 'pre'\n",
    "\n",
    "# Specifies how to truncate the sequences\n",
    "TRUNC_TYPE = 'post'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zN9-ojV55UCR"
   },
   "source": [
    "## Split the Dataset\n",
    "\n",
    "You will then split the lists into train and test sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "50H0ZrJf035i"
   },
   "outputs": [],
   "source": [
    "# Split the sentences\n",
    "train_sentences = sentences[0:TRAINING_SIZE]\n",
    "test_sentences = sentences[TRAINING_SIZE:]\n",
    "\n",
    "# Split the labels\n",
    "train_labels = labels[0:TRAINING_SIZE]\n",
    "test_labels = labels[TRAINING_SIZE:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MYVNY4tE5YbN"
   },
   "source": [
    "## Data preprocessing\n",
    "\n",
    "Next, you will generate the vocabulary and padded sequences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "C2xJz4hLiW8-"
   },
   "outputs": [],
   "source": [
    "# Instantiate the vectorization layer\n",
    "vectorize_layer = tf.keras.layers.TextVectorization(max_tokens=VOCAB_SIZE)\n",
    "\n",
    "# Generate the vocabulary based on the training inputs\n",
    "vectorize_layer.adapt(train_sentences)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "rw8sf708-QAs"
   },
   "source": [
    "You will combine the sentences and labels, then put them in a `tf.data.Dataset`. This will let you leverage the `tf.data` pipeline methods you've been using to preprocess the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "667RxU6mikTo"
   },
   "outputs": [],
   "source": [
    "# Put the sentences and labels in a tf.data.Dataset\n",
    "train_dataset = tf.data.Dataset.from_tensor_slices((train_sentences,train_labels))\n",
    "test_dataset = tf.data.Dataset.from_tensor_slices((test_sentences,test_labels))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-RjsToPZ_STW"
   },
   "source": [
    "You will use the same preprocessing function from the previous lab to generate the padded sequences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "0Tx35pdcp0Ig"
   },
   "outputs": [],
   "source": [
    "def preprocessing_fn(dataset):\n",
    "  '''Generates padded sequences from a tf.data.Dataset'''\n",
    "\n",
    "  # Apply the vectorization layer to the string features\n",
    "  dataset_sequences = dataset.map(\n",
    "      lambda text, label: (vectorize_layer(text), label)\n",
    "      )\n",
    "\n",
    "  # Put all elements in a single ragged batch\n",
    "  dataset_sequences = dataset_sequences.ragged_batch(\n",
    "      batch_size=dataset_sequences.cardinality()\n",
    "      )\n",
    "\n",
    "  # Output a tensor from the single batch. Extract the sequences and labels.\n",
    "  sequences, labels = dataset_sequences.get_single_element()\n",
    "\n",
    "  # Pad the sequences\n",
    "  padded_sequences = tf.keras.utils.pad_sequences(\n",
    "      sequences.numpy(),\n",
    "      maxlen=MAX_LENGTH,\n",
    "      truncating=TRUNC_TYPE,\n",
    "      padding=PADDING_TYPE\n",
    "      )\n",
    "\n",
    "  # Convert back to a tf.data.Dataset\n",
    "  padded_sequences = tf.data.Dataset.from_tensor_slices(padded_sequences)\n",
    "  labels = tf.data.Dataset.from_tensor_slices(labels)\n",
    "\n",
    "  # Combine the padded sequences and labels\n",
    "  dataset_vectorized = tf.data.Dataset.zip(padded_sequences, labels)\n",
    "\n",
    "  return dataset_vectorized"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "54uLivYDqSMA"
   },
   "outputs": [],
   "source": [
    "# Preprocess the train and test data\n",
    "train_dataset_vectorized = train_dataset.apply(preprocessing_fn)\n",
    "test_dataset_vectorized = test_dataset.apply(preprocessing_fn)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cgC_oeb9_dPY"
   },
   "source": [
    "It's always good to check a few examples to see if the transformation works as expected."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "PFDERsqWqkkS"
   },
   "outputs": [],
   "source": [
    "# View 2 training sequences and its labels\n",
    "for example in train_dataset_vectorized.take(2):\n",
    "  print(example)\n",
    "  print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3kJC5Er9_k0l"
   },
   "source": [
    "Then, you will optimize and batch the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "nrvjR3wdizDn"
   },
   "outputs": [],
   "source": [
    "SHUFFLE_BUFFER_SIZE = 1000\n",
    "PREFETCH_BUFFER_SIZE = tf.data.AUTOTUNE\n",
    "BATCH_SIZE = 32\n",
    "\n",
    "# Optimize and batch the datasets for training\n",
    "train_dataset_final = (train_dataset_vectorized\n",
    "                       .cache()\n",
    "                       .shuffle(SHUFFLE_BUFFER_SIZE)\n",
    "                       .prefetch(PREFETCH_BUFFER_SIZE)\n",
    "                       .batch(BATCH_SIZE)\n",
    "                       )\n",
    "\n",
    "test_dataset_final = (test_dataset_vectorized\n",
    "                      .cache()\n",
    "                      .prefetch(PREFETCH_BUFFER_SIZE)\n",
    "                      .batch(BATCH_SIZE)\n",
    "                      )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nGLKQBpw5zz8"
   },
   "source": [
    "## Plot Utility"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "6CvBW0705yZ6"
   },
   "outputs": [],
   "source": [
    "def plot_loss_acc(history):\n",
    "  '''Plots the training and validation loss and accuracy from a history object'''\n",
    "  acc = history.history['accuracy']\n",
    "  val_acc = history.history['val_accuracy']\n",
    "  loss = history.history['loss']\n",
    "  val_loss = history.history['val_loss']\n",
    "\n",
    "  epochs = range(len(acc))\n",
    "\n",
    "  fig, ax = plt.subplots(1,2, figsize=(12, 6))\n",
    "  ax[0].plot(epochs, acc, 'bo', label='Training accuracy')\n",
    "  ax[0].plot(epochs, val_acc, 'b', label='Validation accuracy')\n",
    "  ax[0].set_title('Training and validation accuracy')\n",
    "  ax[0].set_xlabel('epochs')\n",
    "  ax[0].set_ylabel('accuracy')\n",
    "  ax[0].legend()\n",
    "\n",
    "  ax[1].plot(epochs, loss, 'bo', label='Training Loss')\n",
    "  ax[1].plot(epochs, val_loss, 'b', label='Validation Loss')\n",
    "  ax[1].set_title('Training and validation loss')\n",
    "  ax[1].set_xlabel('epochs')\n",
    "  ax[1].set_ylabel('loss')\n",
    "  ax[1].legend()\n",
    "\n",
    "  plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "o23gJhj95el5"
   },
   "source": [
    "## Build and Compile the Model\n",
    "\n",
    "The architecture here is almost identical to the one you used in the previous lab with the IMDB Reviews. Try to tweak the parameters and see how it affects the training time and accuracy (both training and validation)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "jGwXGIXvFhXW"
   },
   "outputs": [],
   "source": [
    "# Parameters\n",
    "EMBEDDING_DIM = 16\n",
    "LSTM_DIM = 32\n",
    "DENSE_DIM = 24\n",
    "\n",
    "# Model Definition with LSTM\n",
    "model_lstm = tf.keras.Sequential([\n",
    "    tf.keras.Input(shape=(MAX_LENGTH,)),\n",
    "    tf.keras.layers.Embedding(input_dim=VOCAB_SIZE, output_dim=EMBEDDING_DIM),\n",
    "    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(LSTM_DIM)),\n",
    "    tf.keras.layers.Dense(DENSE_DIM, activation='relu'),\n",
    "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
    "])\n",
    "\n",
    "# Set the training parameters\n",
    "model_lstm.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\n",
    "\n",
    "# Print the model summary\n",
    "model_lstm.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "krcQGm7B5g9A"
   },
   "source": [
    "## Train the Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "nEKV8EMj11BW"
   },
   "outputs": [],
   "source": [
    "NUM_EPOCHS = 10\n",
    "\n",
    "# Train the model\n",
    "history_lstm = model_lstm.fit(train_dataset_final, epochs=NUM_EPOCHS, validation_data=test_dataset_final)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "g9DC6dmLF8DC"
   },
   "outputs": [],
   "source": [
    "# Plot the accuracy and loss\n",
    "plot_loss_acc(history_lstm)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Wrap Up\n",
    "\n",
    "This concludes this lab on using LSTMs for the Sarcasm dataset. You will explore another architecture in the next lab. Before doing so, run the cell below to free up resources."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Shutdown the kernel to free up resources. \n",
    "# Note: You can expect a pop-up when you run this cell. You can safely ignore that and just press `Ok`.\n",
    "\n",
    "from IPython import get_ipython\n",
    "\n",
    "k = get_ipython().kernel\n",
    "\n",
    "k.do_shutdown(restart=False)"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "name": "C3_W3_Lab_5_sarcasm_with_bi_LSTM.ipynb",
   "private_outputs": true,
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0rc1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
